PE File Clustering and Yara Signature Generation

Snap into a Yara sig!

Once upon a time we looked at classifying PE and Mach-O files. This time it's flipped on its head. Is it possible to use various clustering algorithms to group similar files together? But, why stop there!? Can we crank up the awesome and use information from those clusters to generate Yara signatures to find files that are similar in nature?

In this notebook we'll explore not only gathering static information from PE files, but clustering on those attributes, and finally show off the capabilities of the Yara signature generation.

Tools

What we did:

  • Gathered data about PE files with pefile (JSON)
  • Read that data in
  • Data cleanup
  • Explored the Data!
    • Graphs, clustering
  • Analyze Results
  • Yara signatures
  • More clustering
  • More analyzing

In [28]:
# All the imports and some basic level setting with various versions
import IPython
import re
import os
import json
import time
import string
import pandas
import pickle
import struct
import socket
import collections
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pefile

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

print "IPython version: %s" %IPython.__version__
print "pandas version: %s" %pd.__version__
print "numpy version: %s" %np.__version__

%matplotlib inline


IPython version: 2.1.0
pandas version: 0.13.1
numpy version: 1.8.1

In [3]:
def get_lang_value(lang):
    for key, value in pefile.LANG.iteritems():
        if value == lang:
            return key
    return 0

In [4]:
# Grab from the json data what we want
def extract_features(filename, data):
    feature = {}
    
    feature['filename'] = filename[26:-8]
        
    feature.update(data['verbose']['pefile']['file header'])
    feature.update(data['verbose']['pefile']['optional header'])
    feature['image base'] = float(feature['image base'])
    feature['size of stack reserve'] = float(feature['size of stack reserve'])
    feature['size of stack commit'] = float(feature['size of stack commit'])
    feature['size of heap reserve'] = float(feature['size of heap reserve'])
    feature['size of heap commit'] = float(feature['size of heap commit'])
    if 'size of image base var' in feature:
        del feature['size of image base var']

    if 'data directories' in data['verbose']['pefile']:
        for k,v in data['verbose']['pefile']['data directories'].iteritems():
            feature['data dir ' + k + ' rva'] = v['rva']
            feature['data dir ' + k + ' size'] = v['size']
 
    '''
    if 'sections' in data['verbose']['pefile']:
        for idx, sec in enumerate(data['verbose']['pefile']['sections']):
            feature['section ' + str(idx) + ' virtual address'] = sec['virtual address']
            feature['section ' + str(idx) + ' virtual size'] = sec['virtual size']

            if idx == 2:
                break
    '''
    if 'resources' in data['verbose']['pefile']:
        feature['number of resources'] = len(data['verbose']['pefile']['resources'])
        for index, resource in enumerate(data['verbose']['pefile']['resources']):
            feature['resource ' + str(index) + ' lang'] = get_lang_value(resource['lang'])
            feature['resource ' + str(index) + ' size'] = resource['size']
            feature['resource ' + str(index) + ' rva'] = resource['rva']

            if index == 2:
                break
    return feature

In [5]:
def extract_vtdata(filename, data):
    vt = {}
    vt['filename'] = filename[26:-7]
    if 'scans' in data:
        if data['positives'] > 0:
            vt['label'] = 'malicious'
        else:
            vt['label'] = 'nonmalicious'
        vt['positives'] = data['positives']
        if 'Symantec' in data['scans']:
            vt['symantec'] = data['scans']['Symantec']['result']
        if 'Sophos' in data['scans']:
            vt['sophos'] = data['scans']['Sophos']['result']
        if 'F-Prot' in data['scans']:
            vt['f-prot'] = data['scans']['F-Prot']['result']
        if 'Kaspersky' in data['scans']:
            vt['kaspersky'] = data['scans']['Kaspersky']['result']
        if 'McAfee' in data['scans']:
            vt['mcafee'] = data['scans']['McAfee']['result']
        if 'Malwarebytes' in data['scans']:
            vt['malwarebytes'] = data['scans']['Malwarebytes']['result']
    else:
        vt['label'] = 'nonmalicious'
        vt['positives'] = 0
    return vt

In [6]:
def load_files(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_features(filename, json.loads(f.read()))
            features_list.append(features)
    return features_list

import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.results')
features = load_files(file_list)
print "Files:", len(file_list)


Files: 1000

In [7]:
def load_vt_data(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_vtdata(filename, json.loads(f.read()))
            features_list.append(features)
    return features_list

import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.vtdata')
vt_data = load_vt_data(file_list)

In [8]:
df = pd.DataFrame.from_records(features)
for col in df.columns:
    if 'resource' in col[0:7]:
        df[col].fillna(-1, inplace=True)
        
df.fillna(-1, inplace=True)
df.head(5)


Out[8]:
base of code base of data characteristics checksum compile date data dir base relocation rva data dir base relocation size data dir debug rva data dir debug size data dir exception table rva data dir exception table size data dir export table rva data dir export table size data dir import address table rva data dir import address table size data dir import table rva data dir import table size data dir resource table rva data dir resource table size data dir tls table rva
0 4096 483328 8462 0 1357217520 835584 28956 0 0 0 0 633648 76 483328 1688 624872 260 811008 24256 0 ...
1 4096 270336 8450 408181 1365113048 380928 12380 5472 28 0 0 267296 134 4096 1292 260852 260 274432 103416 0 ...
2 4096 28672 271 2018515944 1260053446 0 0 0 0 0 0 0 0 28672 652 29604 180 208896 16944 0 ...
3 4096 28672 271 5723026 1260053452 0 0 0 0 0 0 0 0 28672 652 29872 180 299008 131024 0 ...
4 4096 151552 8450 313813 1306975492 241664 10008 0 0 0 0 195360 1163 151552 1128 189700 220 221184 19560 0 ...

5 rows × 63 columns


In [9]:
df_vt = pd.DataFrame.from_records(vt_data)
df_vt.fillna('No detection', inplace=True)
df_vt.head(5)


Out[9]:
f-prot filename kaspersky label malwarebytes mcafee positives sophos symantec
0 W32/Agent.EW.gen!Eldorado 0027e07dccc0ecbd051591607262bfd5d856adecf986e6... No detection malicious No detection Artemis!A9D90198DF20 35 No detection No detection
1 No detection 011809e9e92f82018c0e2425fa976d071b7acbff7a342d... No detection nonmalicious No detection No detection 0 No detection No detection
2 No detection 0130326c71bc3fe20fe13e7e2aac753fb6c178a4c1dd50... No detection malicious PUP.Optional.Domalq Artemis!A2ABED494338 21 DomainIQ pay-per install Trojan.ADH
3 W32/Trojan3.IUT 013d8c4ea64f1c5ce424ae224ae65adfd11a3982d3faba... not-a-virus:AdWare.Win32.Lyckriks.cw malicious PUP.Optional.OpenCandy.A Adware-OpenCandy.dll 28 Generic PUA LN WS.Reputation.1
4 No detection 01a8d23e2b114162262eaffd1c311450b56efdf3063372... No detection nonmalicious No detection No detection 0 No detection No detection

5 rows × 9 columns


In [10]:
cols = [x for x in df.columns.tolist() if x != 'filename']

Let's look at the raw data. But first, since we humans can't really visualize things with over 60 dimensions, we can use PCA to project all those features down to a few so that we can graph. In this case, we'll look at 2D and 3D images that represent the data. It's interesting to see how much information is lost between 3D and 2D... too bad we can't see how the data would look in all the dimensions!


In [11]:
X = df.as_matrix(cols)
from sklearn.preprocessing import scale
X = scale(X)

from sklearn.decomposition import PCA
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)

In [12]:
from mpl_toolkits.mplot3d import Axes3D

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], s=50)
ax.set_title("Features in 3D")
ax = fig.add_subplot(1, 2, 2)
ax.scatter(DD[:,0], DD[:,1], s=50)
ax.set_title("Features in 2D")
plt.show()


First up is DBSCAN, it enjoys long walks on the beach, non-flat geometry, and uneven cluster sizes (http://scikit-learn.org/stable/modules/clustering.html). This seemed like a good selection for many different reasons. We expect to have several uneven cluster sizes as this sample of files contains both malware and nonmalicious binaries. By building the features from the file structure, this should pick out several different tool chains (compilers, etc...) used and it would be surprising to have even distributions of that type of information in the data set. Hopefully we will even be able to cluster malware families together. Another nice feature of the scikit learn implementation is that all samples that don't belong to a cluster are labeled with "-1". This avoid shoving files into clusters and reducing the efficency of any generated Yara signature. However, if we're searching for more generic sigs we can play games to get more samples in clusters or use different algoritms.

We also show the difference between non-scaled and non-reduced data, and how you can get different (and usually better) results by scaling and reducing.


In [17]:
from sklearn.cluster import DBSCAN
X = df.as_matrix(cols)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = df[['filename','cluster']]

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()


Number of clusters: 7
Labeled samples: 36
Unlabeled samples: 964

We can see without PCA just about everything is unlabeled. Let's try again but using PCA. First we determine how many dimensions to reduce to, then we cluster.


In [18]:
X = df.as_matrix(cols)
X = scale(X)
pca = PCA().fit(X)
n_comp = len([x for x in pca.explained_variance_ if x > 1e0])
print "Number of components w/explained variance > 1: %s" % n_comp


Number of components w/explained variance > 1: 20

In [19]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = df[['filename','cluster']]

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()


Number of clusters: 63
Labeled samples: 418
Unlabeled samples: 582

Half the files ended up unclustered, so that's a little disappointing, but still a huge improvement.


In [20]:
dbscan_df.cluster.value_counts().head(10)


Out[20]:
-1     582
 9      58
 5      32
 11     30
 4      23
 19     19
 2      16
 6      13
 23     10
 29      9
dtype: int64

Let's see these clusters in 3D and 2D now.


In [21]:
# Remove unlabeled samples for graphing to make it prettier
tempdf = df[df['cluster'] != -1].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)

figsize(12,12)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(2, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 2, projection='3d')
ax.set_xlim(-5,5)
ax.set_ylim(-5,15)
ax.set_zlim(-5,5)
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
ax = fig.add_subplot(2, 2, 3)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 4)
ax.set_xlim(-3,4)
ax.set_ylim(-5,7)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
plt.show()


Let's see how well DBSCAN did. To this end, we use data from VirusTotal to help us.


In [22]:
dbscan_vt_df = pd.merge(dbscan_df, df_vt, on='filename', how='outer')
dbscan_vt_df.head()


Out[22]:
filename cluster f-prot kaspersky label malwarebytes mcafee positives sophos symantec
0 0027e07dccc0ecbd051591607262bfd5d856adecf986e6... 42 W32/Agent.EW.gen!Eldorado No detection malicious No detection Artemis!A9D90198DF20 35 No detection No detection
1 011809e9e92f82018c0e2425fa976d071b7acbff7a342d... -1 No detection No detection nonmalicious No detection No detection 0 No detection No detection
2 0130326c71bc3fe20fe13e7e2aac753fb6c178a4c1dd50... 25 No detection No detection malicious PUP.Optional.Domalq Artemis!A2ABED494338 21 DomainIQ pay-per install Trojan.ADH
3 013d8c4ea64f1c5ce424ae224ae65adfd11a3982d3faba... -1 W32/Trojan3.IUT not-a-virus:AdWare.Win32.Lyckriks.cw malicious PUP.Optional.OpenCandy.A Adware-OpenCandy.dll 28 Generic PUA LN WS.Reputation.1
4 01a8d23e2b114162262eaffd1c311450b56efdf3063372... 7 No detection No detection nonmalicious No detection No detection 0 No detection No detection

5 rows × 10 columns

Below, we can see here that most of the clusters do not mix malicious and nonmalicious, that's a good start. And looking further at some of the malicious clusters, we can see that it's doing a pretty good job of grouping families together.

Hooray, it's useful!


In [23]:
clusters = set()
print "Total Number of Clusters: %s\n" % (len(dbscan_vt_df['cluster'].unique().tolist()))
for name, blah in dbscan_vt_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])


Total Number of Clusters: 63

-1.0 Cluster has both Malicious and Non-Malicious Samples
0.0 Cluster has both Malicious and Non-Malicious Samples
7.0 Cluster has both Malicious and Non-Malicious Samples
12.0 Cluster has both Malicious and Non-Malicious Samples
26.0 Cluster has both Malicious and Non-Malicious Samples
27.0 Cluster has both Malicious and Non-Malicious Samples
31.0 Cluster has both Malicious and Non-Malicious Samples
49.0 Cluster has both Malicious and Non-Malicious Samples

In [24]:
dbscan_cluster_results = dbscan_vt_df.groupby(['cluster', 'label']).count()
dbscan_cluster_results[['filename']].head(10)


Out[24]:
filename
cluster label
-1 malicious 245
nonmalicious 337
0 malicious 1
nonmalicious 2
1 malicious 3
2 nonmalicious 16
3 nonmalicious 5
4 nonmalicious 23
5 malicious 32
6 nonmalicious 13

10 rows × 1 columns


In [25]:
dbscan_vt_df[dbscan_vt_df['filename'] == 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c']


Out[25]:
filename cluster f-prot kaspersky label malwarebytes mcafee positives sophos symantec
816 dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d... 29 W32/Vobfus.AA.gen!Eldorado Worm.Win32.WBNA.bmf malicious Trojan.Downloader.ic Generic VB.kk 48 Mal/SillyFDC-T W32.Changeup!gen15

1 rows × 10 columns


In [26]:
cluster_dc2 = dbscan_vt_df[dbscan_vt_df['cluster'] == 29]
cluster_dc2[['f-prot', 'mcafee', 'symantec', 'sophos', 'kaspersky', 'malwarebytes']]


Out[26]:
f-prot mcafee symantec sophos kaspersky malwarebytes
59 W32/VB.ID.gen!Eldorado W32/Autorun.worm.aaeh W32.Changeup Troj/Agent-ZQE Worm.Win32.Vobfus.atyr No detection
188 W32/Vobfus.AI.gen!Eldorado Generic VB.kk W32.Changeup!gen15 Mal/SillyFDC-U Trojan.Win32.VBKrypt.izdo Worm.Obfuscated
327 W32/Vobfus.BE.gen!Eldorado Generic Downloader.oq W32.Changeup Mal/SillyFDC-W Worm.Win32.WBNA.bul Worm.Obfuscated
369 W32/Vobfus.O.gen!Eldorado VBObfus.dv W32.Changeup W32/Autorun-BWV Worm.Win32.WBNA.ipa Worm.Obfuscated
503 W32/Vobfus.AA.gen!Eldorado Generic VB.kk W32.Changeup Mal/SillyFDC-T Worm.Win32.WBNA.bqi No detection
506 W32/Vobfus.AQ.gen!Eldorado VBObfus.dv W32.Changeup W32/Vobfus-AI Worm.Win32.WBNA.ipa Trojan.Downloader.ic
816 W32/Vobfus.AA.gen!Eldorado Generic VB.kk W32.Changeup!gen15 Mal/SillyFDC-T Worm.Win32.WBNA.bmf Trojan.Downloader.ic
873 W32/Vobfus.AI.gen!Eldorado Generic VB.kk W32.Changeup!gen17 Mal/VBCheMan-B Worm.Win32.WBNA.bul Worm.Obfuscated
921 W32/Vobfus.O.gen!Eldorado W32/Autorun.worm.aaeh W32.Changeup W32/Autorun-BWV Worm.Win32.WBNA.ipa Trojan.Downloader.ic

9 rows × 6 columns

Perfect, we've got our files clustered in groups that appear to be similar/close to one other. This is great if we wanted to stop here, but how many times can you run a Python model on machines in an enterprise or on an appliance in order to find similar files? Probably not very often. Instead we need to get this information out of Python and into something usable on files: Yara!

Below you'll see a simple call-out to a yara_signature python module. This module contains code to generate a signature based on attributes found in the file. We've chosen a cluster (3) and a file from that cluster to base the signature off of. Then the attributes from the cluster that are non-zero (present) are added to the signature. Some of the struct values can be influenced in the sig, and that's the reason for the multiple lists to keep track of various attributes.


In [29]:
import yara_signature
import struct

name = 26
fdf = pd.DataFrame()
for f in dbscan_df[dbscan_df['cluster'] == name].filename.tolist():
    fdf = fdf.append(df[df['filename'] == f], ignore_index=True)
    
# Choose a signature from cluster to use as the basis of the sig w/the attributes below
filename = 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c'
meta = {"author" : "dorsey", "email" : "dorsey_at_clicksecurity_dot_com"}

sig = yara_signature.yara_pe_generator.YaraPEGenerator('./'+filename, samplename="Cluster_"+str(name), meta=meta)

file_header_columns = ["pointer to symbol table", "characteristics", "number of symbols", "size of optional header",
                        "machine", "compile date", "number of sections"]

optional_header_columns = ["subsystem", "major image version", "image base", "size of heap reserve",
                           "major operating system version", "section alignment", "loader flags",
                           "minor subsystem version", "major linker version", "size of stack commit",
                           "size of code", "size of image", "number of rva and sizes", "dll charactersitics",
                           "file alignment", "size of stack reserve", "minor linker version", "base of code",
                           "size uninit data", "entry point address", "size init data", "major subsystem version",
                           "magic", "checksum", "size of heap commit", "minor image version",
                           "minor operating system version", "size of headers", "base of data", "size of image base var",
                           "data dir base relocation rva", "data dir base relocation size", "data dir debug rva",
                           "data dir debug size", "data dir exception table rva", "data dir exception table size",
                           "data dir export table rva", "data dir export table size", "data dir import address table rva",
                           "data dir import address table rva", "data dir import address table size",
                           "data dir import table rva", "data dir import table size", "data dir import table size",
                           "data dir resource table rva", "data dir resource table size", "data dir tls table rva",
                           "data dir tls table size"]

file_header = []
optional_header = {}

for col in fdf.columns:
    if len(fdf[col].unique()) == 1:
        if fdf[col].unique()[0] != -1:
            lower = [s for s in col if s.islower()]
            if fdf[col].unique()[0] != -1 or (len(lower) == len(col)):
                if col in file_header_columns:
                    file_header.append(col)
                if col in optional_header_columns:
                    optional_header[col] = struct.pack("<I", int(fdf[col].unique()[0])).encode('hex')

    if len(fdf[col].unique()) > 1:
        if col not in optional_header_columns:
            continue

        if type(fdf[col].unique()[0]) == str or len(fdf[col].unique()) > 9:
            continue

        u = []
        z = []
        for value in fdf[col].unique():
            u.append(struct.pack("<I", value).encode("hex"))

        for d in zip(*u):
            match = True
            for idx in range(1,len(d)):
                if d[0] != d[idx]:
                    match = False
                    break
            if match:
                z.append(d[0])
            else:
                z.append('?')
        string = ''.join(z)
        if string != '????????':
            optional_header[col] = string

if len(file_header) > 0:
    sig.add_file_header(file_header)

if len(optional_header) > 0:
    sig.add_optional_header_with_values(optional_header)

print sig.get_signature()


rule Cluster_26
{
meta:
    author = "dorsey"
    email = "dorsey_at_clicksecurity_dot_com"
    generator = "This sweet yara sig generator!"

strings:
    $FileHeader = { 4c 01 ?? ?? ?? ?? ?? ?? 00 00 00 00 00 00 00 00 e0 00 }
    $OptionalHeader = { 0b 01 08 00 00 ?? 0? 00 00 ?? 0? 00 00 00 00 00 ?? ?? 0? 00 00 20 00 00 00 ?0 0? 00 00 00 40 00 00 20 00 00 00 02 00 00 04 00 00 00 ?? ?? ?? ?? 04 00 00 00 ?? ?? ?? ?? 00 ?0 0? 00 00 0? 00 00 00 00 00 00 02 00 ?? ?? 00 00 10 00 00 10 00 00 00 00 10 00 00 10 00 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 ?? ?? 0? 00 5? 00 00 00 00 ?0 0? 00 ?? ?? 0? 00 00 00 00 00 00 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? 00 ?0 0? 00 0c 00 00 00 ?? ?? 00 00 1c 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 00 00 00 00 00 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 00 20 00 00 08 00 00 00 }
condition:
    $FileHeader at 204 and
    $OptionalHeader at 224
}

Since we've got one method of clustering to Yara signature down, let's take a brief look at what happens to the cluster shapes/distributions with some other types of cluster algoritms.

Next up, KMeans. It will put every sample into a cluster, and this algorithm the number of clusters needs to be specified. There are a bunch of ways you can determine how many clusters, below we went with a simple one from Wikipedia (http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).


In [32]:
from sklearn.cluster import KMeans
X = df.as_matrix(cols)
X = scale(X)
#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = int(math.sqrt(int(len(X)/2)))

kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]

print "Number of clusters: %d" % nclusters


Number of clusters: 22

In [33]:
df.cluster.value_counts().head(10)


Out[33]:
18    362
9     186
6     136
4     130
0      75
1      68
7      21
21      4
11      2
16      2
dtype: int64

In [34]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-4,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters (zoomed in)")
plt.show()



In [35]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = 22

kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]

print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-3,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters (zoomed in)")
plt.show()


Number of clusters: 22

Cluster/Sample Layout
2     370
7     182
15    134
21    127
5      75
0      54
1      28
19      8
12      4
9       4
dtype: int64

Above you can see how scaling and PCA lead to a bit more balanced layout of some of the clusters, but we've still got some outliers. Not a huge deal, just another way to slice and look at the data.

Let's see how kmeans did clustering files.


In [37]:
kmeans_vt_df = pd.merge(kmeans_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = kmeans_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)


Out[37]:
filename
cluster label
0 malicious 37
nonmalicious 17
1 malicious 13
nonmalicious 15
2 malicious 185
nonmalicious 185
3 malicious 2
4 malicious 1
5 malicious 71
nonmalicious 4

10 rows × 1 columns


In [38]:
clusters = set()
print "Total Number of Clusters: %s\n" % (len(kmeans_vt_df['cluster'].unique().tolist()))
for name, blah in kmeans_vt_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])


Total Number of Clusters: 22

0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples
2 Cluster has both Malicious and Non-Malicious Samples
5 Cluster has both Malicious and Non-Malicious Samples
7 Cluster has both Malicious and Non-Malicious Samples
9 Cluster has both Malicious and Non-Malicious Samples
15 Cluster has both Malicious and Non-Malicious Samples
19 Cluster has both Malicious and Non-Malicious Samples
21 Cluster has both Malicious and Non-Malicious Samples

Below we're looking at MeanShift. Scikit learn is nice enough to tell us a bit about MeanShift usecases (Many clusters, uneven cluster size, non-flat geometry). This seems to, once again, fit our data pretty well. Maybe we can get some better/different layouts of clusters here.


In [39]:
from sklearn.cluster import MeanShift, estimate_bandwidth

X = df.as_matrix(cols)
X = scale(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
meanshift_cluster_df = df[['filename', 'cluster']]

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters


Estimated Bandwidth: 6.20207046217
Number of clusters: 52

In [40]:
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()



In [41]:
df.cluster.value_counts().head(10)


Out[41]:
0     910
3       8
6       8
1       7
2       6
4       4
5       4
28      4
10      2
36      2
dtype: int64

In [42]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
cluster_df = df[['filename', 'cluster']]

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

# Once again we can remove, in this case, the largest cluster for a less dense graph
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()


Estimated Bandwidth: 4.95944523855
Number of clusters: 42

Cluster/Sample Layout
0     921
3      13
1       8
2       6
4       4
5       4
31      3
9       2
8       2
30      2
dtype: int64


In [43]:
ms_vt_df = pd.merge(cluster_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = ms_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)


Out[43]:
filename
cluster label
0 malicious 462
nonmalicious 459
1 malicious 1
nonmalicious 7
2 malicious 6
3 malicious 6
nonmalicious 7
4 nonmalicious 4
5 malicious 4
6 nonmalicious 2

10 rows × 1 columns

It seems we've run into a similar case with MeanShift as with DBSCAN. Instead of being unlabed, we wound up with one cluster with the vast majority of samples. Unfortunately, using PCA doesn't help very much, and of the samples remain in that one large cluster.

Overall, it's important to see how using different algorithms can impact the end result. Understanding that impact when trying to transfer knowledge from one domain to another is also important. This way it's possible to see how the various cluster techniques can lead to different Yara signatures which will fire on different sets of files. When dealing with large amounts of malware, this is one way to group existing and detect new potential variants of the same family.

Good luck and happy hunting!